In [1]:
# imports

# for dataframes
import pandas as pd
import pandasql as ps
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# for converting time stuff
from datetime import date
import calendar

# for display
from IPython.display import Image, display, IFrame

CitiBike Dock Placement Optimization

Document Navigation

  1. Introduction
    1.1 Objective
    1.2 Definitions
    1.3 Data

  2. Gathering files
  3. Feature Engineering
  4. Day in the life of CitiBike System
  5. Empty docks / Busy docks (where they are, and when they tend to be busy/emtpy) -- if possible add where the journeys are coming from based on the one month of trip data

1. Introduction

Objective:

Place CitiBike's next XX docks in a way that aliviates the number of minutes the system experiences "critical points"

Definitions:

  • Utilization: Available Bikes / Dock Capacity
  • Critical points: When a dock is at 0% or 100% utilization

Data:

2. Gathering Files

This will serve as our raw database for analysis.

Notes:

  • Relevant month files were saved locally
  • A total of 6 months are pulled, but due to missing data we only observe 142 total days
    • the following months have the following start dates in the data
      • June: 8th
      • August: 10th
      • Sept: 9th
      • Oct 10th
  • Months June - October were chosen due to seasonality effect of Citibike usage
In [4]:
# import local files
df_05 = pd.read_csv('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/docks_data/bikeshare_nyc_raw_201705.csv', sep = '\s+',error_bad_lines=False)#delim_whitespace=True)
df_06 = pd.read_csv('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/docks_data/bikeshare_nyc_raw_201706.csv', sep = '\s+',error_bad_lines=False)#delim_whitespace=True)
df_07 = pd.read_csv('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/docks_data/bikeshare_nyc_raw_201707v2.csv', sep = '\s+',error_bad_lines=False)#delim_whitespace=True)
df_08 = pd.read_csv('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/docks_data/bikeshare_nyc_raw_201708v2.csv', sep = '\s+',error_bad_lines=False)#delim_whitespace=True)
df_09 = pd.read_csv('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/docks_data/bikeshare_nyc_raw_201709v2.csv', sep = '\s+',error_bad_lines=False)#delim_whitespace=True)
df_10 = pd.read_csv('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/docks_data/bikeshare_nyc_raw_201710v2.csv', sep = '\s+',error_bad_lines=False)#delim_whitespace=True)
In [5]:
# appending all
df_all = df_05.append([df_06,df_07,df_08,df_09,df_10])
In [6]:
# reindexing
df_all.reset_index(inplace=True,drop=True) # inplace prevents a new object being created
                                           # drop gets rid of the previous index

3. Feature Engineering

Summary of features added:

  • Utilization
  • Military time for hours [mil_test]
  • Military time with minutes as a decimal (for plotting) [mil_min_test]
  • Separate column for year, month, day (derivative from date, and feeder to day_of_week) [year] [month] [day]
  • Day of the week related to that date
  • Neighborhood breaks
    • scrape from: [insert url]
In [7]:
# --------- Utilization
df_all['utilization'] = df_all['avail_bikes'] / df_all['tot_docks']

df_all.loc[~np.isfinite(df_all['utilization']), 'utilization'] = 1
In [8]:
# --------- Adding hours in military and minutes

# step 1: Create default column

df_all['mil_test'] = df_all['hour']

# step 2: Change the default condition based on a rule (pm first)

df_all.loc[(df_all['pm'] == 1) & # filter for pm 
       (df_all['hour'] <= 11), # do all the hours less than 11... (12 pm should stay as 12)
       'mil_test'] = df_all['hour'] + 12 # add 12 to all the pm hours between 1pm and 11pm

# step 3: Change the default condition based on a rule (midnight)

df_all.loc[(df_all['pm'] == 0) & # filtering for am
       (df_all['hour'] == 12), # we just want to take care of midnight
       'mil_test'] = 24 

# step 4: Add in minutes

df_all['mil_min_test'] = df_all['mil_test'] + df_all['minute'] / 60
In [9]:
# --------- Getting day of the week

# step 5: Separate year / month / day columns

df_all['year'] = df_all.date.str[0:2]
df_all['month'] = df_all.date.str[3:5]
df_all['day'] = df_all.date.str[6:8]

# convert to int
df_all['year'] = pd.to_numeric(df_all['year'])
df_all['month'] = pd.to_numeric(df_all['month'])
df_all['day'] = pd.to_numeric(df_all['day'])

# step 6a: Create a weekday list and append to the dataframe

year_array = np.array(df_all['year'])
month_array = np.array(df_all['month'])
day_array = np.array(df_all['day'])


weekday_array = []

for i in range(len(year_array)):
    new_day = date(year_array[i], month_array[i], day_array[i])
    new_dayname = calendar.day_name[new_day.weekday()]
    weekday_array.append(new_dayname)

#append to dateframe once list is created
df_all['day_of_week'] = weekday_array
In [10]:
# --------- Formatting date and year so it plays nice with SQL

# adding the "20" in front of date
date_list = df_all['date']

new_date_list = []

for date in date_list:
    new_date_list.append(str(20) + date)
    
new_date_list

# doing the same with year column
year_list = df_all['year']

new_year_list = []

for year in year_list:
    new_year_list.append(str(20) + str(year))
In [11]:
df_all['date'] = new_date_list
df_all['year'] = new_year_list

df_all['year'] = pd.to_numeric(df_all['year'])

Exporting

In [12]:
df_all.to_pickle('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/pickle_files/pickle_df_all.pkl')
In [ ]:
df_all.to_csv(path_or_buf='/Users/SilviaRuiz/dsi_tasks/capstone/citibike/df_all.csv')

Importing

In [3]:
# base dateframe for entire analysis
df_all = pd.read_pickle('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/pickle_files/pickle_df_all.pkl')
In [ ]:
# adding weekday/weekend flag & commuting times (morning/evening) flag
df_all_weekday_times = pd.read_pickle('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/pickle_files/pickle_df_all_weekday_times.pkl')
In [44]:
# empty docks from SQL query
empty_docks_df = pd.read_csv('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/empty_docks_v2.csv')
In [5]:
# empty docks with average min, median, and max times for empty times
empty_docks_df_v3 = pd.read_csv('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/empty_docks_v3.csv')
In [46]:
# full docks with average min, median, and max times for empty times
full_docks_df = pd.read_csv('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/full_docks.csv')

4. Learning the System (overall)

4.1 Docks Location

Showing locations of docks and color coding based on the average utilization on Monday, July 3rd. At first glance, we can see that docks around Central Park tend to be empty most often, while docks in Lower Manhattan and Brooklyn tend to have more bikes.

In [2]:
# r script

'''
# 1. Install needed packages

# 2. importing libraries
library(htmltools)
library(ggmap)
library(gganimate)
library(gapminder)
library(sp)
library(leaflet)
library(dplyr)
library(tidyverse) # what lets you import csv

# 3. first plotting with July 3rd data

  # 3.1 creating dataframe
df_docks <- read_csv("/Users/SilviaRuiz/dsi_tasks/capstone/citibike/df_2017julythird.csv",
                  col_types = cols(
                                    line = col_integer(),
                                    dock_id = col_double(),
                                    dock_name = col_character(),
                                    date = col_character(),
                                    hour = col_integer(),
                                    minute = col_integer(),
                                    pm = col_integer(),
                                    avail_bikes = col_integer(),
                                    avail_docks = col_integer(),
                                    tot_docks = col_integer(),
                                    lat = col_double(),
                                    long = col_double(),
                                    in_service = col_integer(),
                                    status_key = col_integer(),
                                    utilization = col_double(),
                                    mil_test = col_double(),
                                    mil_min_test = col_double(),
                                    year = col_integer(),
                                    month = col_integer(),
                                    day = col_integer(),
                                    day_of_week = col_character(),
                                    date_time = col_double() 
                  ))
head(df_docks)

  # 3.2 building spatial df and setting up colors
spatial_df <- df_docks
coordinates(spatial_df) <- ~ long + lat

pal <- colorNumeric("RdGy", domain = NULL, reverse = TRUE)

  # 3.3 creating base map, and passing in df
leaflet(spatial_df) %>%
  addTiles() %>%
  setView(-74.00, 40.71, zoom = 12) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircleMarkers(data = spatial_df,
                   radius = ~ntile(utilization,5),
                   color = ~pal(utilization),
                   label = ~dock_name,
                   stroke = FALSE, fillOpacity = 0.5)
'''

IFrame('docks_location_v2.html', width=900, height=450)
Out[2]:

This shows all locations of CitiBike docks and their stacked utilization rates during the day. Grey meaning more empty and red meaning more full. White meaning ~50% utilization

4.2 Dock Utilization

In [6]:
plt.figure(figsize=(15,5))
df_all[df_all['in_service'] == 1].utilization.hist(bins=100)
plt.show()

seems like there are a lot more times of empty docks than full docks, even after excluding docks that are not in service

4.3 Bike Availability

In [25]:
plt.figure(figsize=(20,7))
df_all[(df_all['in_service']==1) & (df_all['mil_test'].between(7, 19, inclusive=True))].groupby('date')['avail_bikes'].mean().plot()
plt.title('Average # of Available Bikes (7am - 7pm)', fontsize= 30)
plt.ylabel('Avg. Bikes')
plt.xlabel('Date')
Out[25]:
Text(0.5,0,'Date')

On average there are between 9 and 13 bikes available in each dock across the system

4.4 Bike Capacity

This changes because docks go in/out of service

In [14]:
ax = sns.violinplot(x="in_service", 
                    y="tot_docks", #hue="commuting_flag",
                    data=df_all, 
                    palette="Set2", 
                    #split=True,
                    scale="count",
                    inner="quartile")

Most docks have between 20-40 bikes of capacity. There are a few docks that are "in-service" that show very little capacity. A little exploration (below) reveals that the true capacity is bigger than the low capacities we see here except for two docks who max and min capacity were "3". These docks should be ignored.

In [83]:
'''
Docks with little capacity explorating
'''

# create base df with docks that showed < 10 bikes of capacity
small_capacity = df_all[(df_all['tot_docks'] < 10) & (df_all['in_service'] == 1)].groupby(['dock_id', 'dock_name', 'tot_docks']).size()
small_capacity_df = small_capacity.to_frame().reset_index()

# create list to pass to larger df
small_capacity_docks = list(small_capacity_df.dock_id)

# get the min and capacity of each dock
min_small_cap = df_all[df_all['dock_id'].isin(small_capacity_docks)].groupby('dock_id')['tot_docks'].min()
max_small_cap = df_all[df_all['dock_id'].isin(small_capacity_docks)].groupby('dock_id')['tot_docks'].max()

# convert to df
min_small_cap_df = min_small_cap.to_frame().reset_index()
max_small_cap_df = max_small_cap.to_frame().reset_index()

min_small_cap_df.rename(index=str, columns={"tot_docks": "min_capacity"}, inplace = True)
max_small_cap_df.rename(index=str, columns={"tot_docks": "max_capacity"}, inplace = True)

# join dataframes to display min and max for each dock

result_capacity = pd.merge(min_small_cap_df,
                              max_small_cap_df,
                              left_on='dock_id',
                              right_on='dock_id',
                              #left_index=True,
                              how='left', sort=False)

result_capacity
Out[83]:
dock_id min_capacity max_capacity
0 174 2 30
1 236 0 39
2 239 5 31
3 282 8 27
4 297 1 27
5 307 3 31
6 309 8 41
7 324 1 52
8 334 5 31
9 421 3 19
10 423 0 39
11 434 0 27
12 459 8 49
13 472 0 41
14 499 0 36
15 507 2 47
16 526 4 39
17 534 2 31
18 3002 0 42
19 3016 3 47
20 3068 1 24
21 3086 3 25
22 3101 4 27
23 3132 0 35
24 3145 3 45
25 3147 3 39
26 3163 2 43
27 3171 4 39
28 3197 0 30
29 3200 0 18
30 3235 0 31
31 3253 0 3
32 3257 3 3
33 3338 2 31
34 3348 8 19
35 3356 1 21
36 3362 0 34
37 3376 4 39
38 3377 6 25
39 3434 2 35
40 3454 9 24

It's important then to make sure to always take the "max capacity" of a dock when it comes to calculating utilization. We can see what the distribution of max docks are

In [4]:
plt.title("min vs. max dock capacity")
sns.distplot(df_all[df_all['in_service'] == 1].groupby('dock_id')['tot_docks'].max(), label='max')
sns.distplot(df_all[df_all['in_service'] == 1].groupby('dock_id')['tot_docks'].min(), label='min')
plt.legend()
Out[4]:
<matplotlib.legend.Legend at 0x1a3bb11cc0>

There are several docks that show a capacity of zero rather than their actual capacity, even when they're labeled "in service" docks

In [7]:
true_capacity = df_all.groupby('dock_id')['tot_docks'].max()
true_capacity.hist(bins=30)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2c830240>

Docks vary in size, but most are between 20-40

5. Visualization of Utilization for different time intervals

Extract of 2017-07-03 (Monday). Showing utilization of each dock around morning and evening commuting times. Using CartoDB for plotting, then later r scripts for animation.

Color Palette & Meaning

COLOR UTILIZATION CRITICAL POINT
dark gray Utilization < 10% CRITICAL POINT at 0%
light gray Utilization between 10% and 30%
yellow Utilization between 30% and 40%
green Utilization between 40% and 60% NONE - OPTIMAL
yellow Utilization between 60% and 70%
orange Utilization between 70% and 90%
red Utilization > 90% CRITICAL POINT AT 100 %

Description : Morning commuting times utilization rates across all docks
Insights

  • Areas that become
    • fuller during
      • morning commuting times: Financial District, Lower East, Central Midtown
      • evening commuting times: Brooklyn, Financial District
    • emptier during
      • morning commuting times: Upper East, around Lincoln Tunnel
      • evening commuting times: Central Midtown
In [19]:
listOfImageNames = ['/Users/SilviaRuiz/dsi_tasks/capstone/citibike/cartoDB_outputs/Carto7-8am.png',
                    '/Users/SilviaRuiz/dsi_tasks/capstone/citibike/cartoDB_outputs/Carto9-10am.png',
                    '/Users/SilviaRuiz/dsi_tasks/capstone/citibike/cartoDB_outputs/Carto11am-12pm.png']

titles = ['Dock Utilization between 7-8am',
          'Dock Utilization between 9-10am']

for (title,imageName) in zip(titles,listOfImageNames):
    display(title, Image(filename=imageName, width=500, height=500))
'Dock Utilization between 7-8am'
'Dock Utilization between 9-10am'

Afternoon commuting times

In [ ]:
listOfImageNames = ['/Users/SilviaRuiz/dsi_tasks/capstone/citibike/cartoDB_outputs/Carto15-16pm.png',
                    '/Users/SilviaRuiz/dsi_tasks/capstone/citibike/cartoDB_outputs/Carto17-18pm.png']

titles = ['Dock Utilization between 3-4pm',
          'Dock Utilization between 5-6pm']

for (title,imageName) in zip(titles,listOfImageNames):
    display(title, Image(filename=imageName, width=500, height=500))

6. Utilization by docks / commuting times

Identified docks at critical points for the entire time period and ranked based on how often they were empty / full. Now joining with more column flags and eventually plotting in R.

Empty Docks

6.1 Empty Docks: Narrowing down priority docks

The focus for empty docks is to understand which docks become empty during commuting times due to riders taking bikes from residential neighborhoods to working areas. Then making sure these docks can be better rebalanced overnight to provide more bikes in morning times.

In [28]:
# top 30 empty docks during commmuting times (7 - 10am)

'''
select dock_id,
	dock_name,
	sum(case when utilization = 1 then 1 else 0 end) as times_full,
	sum(case when utilization = 0 then 1 else 0 end) as times_empty
from raw_data.all_data
where mil_test between 7 and 10 -- 7am to 10am  
	and in_service = 1
group by 1,2
order by 3 desc
limit 30
'''
# these are actually empty docks
empty_docks = [3337,3395,3340,3352,3394,3344,3326,3436,3342,456,3184,3392,3381,3373,337,3330,3399,351,3391,3393,260,3348,3466,3233,418,534,2005,2023,430,2001]

Next step is to show when these docks are emtpy. We want to capture ones that eb and flow in their stages of utilization. i.e., we don't care to observe docks that are under utilized and so therefore usually empty.

In [9]:
# getting average min, median, and max time of utilization at 0%
'''
select date,
            --utilization,
            orig.dock_id,
			count(utilization) as times_empty_in_day,
            min(mil_min_test) as min_time,
            percentile_disc(0.5) within group (order by mil_min_test) as med_time,
            max(mil_min_test) as max_time
      from raw_data.all_data as orig
      inner join

			-- 1. Docks that have been empty a lot during the day in the full time period
			(select dock_id,
					sum(case when utilization = 1 then 1 else 0 end) as times_full,
					sum(case when utilization = 0 then 1 else 0 end) as times_empty
			from raw_data.all_data
			where mil_test between 7 and 22 -- the full day (7am-10pm)
					and in_service = 1 -- only docks in service
			group by 1
			order by 3 desc -- sort by docks that are most empty in the time period (change to 2 desc for "full")
			limit 30 -- only take the top 30
			) as empty on orig.dock_id = empty.dock_id

	  where utilization = 0 -- change to "1" for "full"
      group by 1,2
    ) as a
	left join raw_data.all_data x on a.dock_id = x.dock_id
group by 1,2,3,4
'''

empty_docks_df = pd.read_csv('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/empty_docks_v2.csv')
In [7]:
# Initialize the figure
plt.figure(figsize=(20,7))
sns.despine(bottom=True, left=True)

# Graph the min, med, and max for all the "empty docks" times
sns.pointplot(x="dock_id", y="Total", hue="Values",
              data=empty_docks_df_v3, dodge=.532, join=False, palette="dark",
              markers="d")

plt.ylabel('Hour')
plt.title('Average min/median/max times of docks at 0% utilization')
Out[7]:
Text(0.5,1,'Average min/median/max times of docks at 0% utilization')

We can see from the above that there are docks with a large range of times in which they are empty. We want to focus on the ones that have a smaller range and preferably the ones that fall in commuting times. Thus, we calculate this range and prioritize the ones with smaller ones.

In [8]:
# since this is in melted format, we need a different format
empty_docks_df_v3.head()
Out[8]:
dock_id Values Total
0 359 Sum of avg_min_time 15.898601
1 359 Sum of avg_med_time 19.326209
2 359 Sum of avg_max_time 21.738804
3 400 Sum of avg_min_time 10.733476
4 400 Sum of avg_med_time 12.665812
In [10]:
# this format will allow us to calculate the range for each of these docks
empty_docks_df.head()
Out[10]:
dock_id dock_name _lat _long days_empty avg_min_time avg_med_time avg_max_time range_of_hours_atcritical
0 359 E 47 St & Park Ave 40.755103 -73.974987 138 6.884662 17.484058 23.689734 16.805072
1 400 Pitt St & Stanton St 40.719261 -73.981780 117 10.733476 12.665812 15.618519 4.885043
2 423 W 54 St & 9 Ave 40.765849 -73.986905 141 6.115366 11.388180 21.665130 15.549764
3 447 8 Ave & W 52 St 40.763707 -73.985162 139 6.352998 12.198201 21.512710 15.159712
4 449 W 52 St & 9 Ave 40.764618 -73.987895 138 5.890097 10.956159 20.549155 14.659058
In [11]:
# calculating range and displaying it

empty_docks_df['range'] = empty_docks_df['avg_max_time'] - empty_docks_df['avg_min_time']
empty_docks_df.head()
Out[11]:
dock_id dock_name _lat _long days_empty avg_min_time avg_med_time avg_max_time range_of_hours_atcritical range
0 359 E 47 St & Park Ave 40.755103 -73.974987 138 6.884662 17.484058 23.689734 16.805072 16.805072
1 400 Pitt St & Stanton St 40.719261 -73.981780 117 10.733476 12.665812 15.618519 4.885043 4.885043
2 423 W 54 St & 9 Ave 40.765849 -73.986905 141 6.115366 11.388180 21.665130 15.549764 15.549764
3 447 8 Ave & W 52 St 40.763707 -73.985162 139 6.352998 12.198201 21.512710 15.159712 15.159712
4 449 W 52 St & 9 Ave 40.764618 -73.987895 138 5.890097 10.956159 20.549155 14.659058 14.659058
In [19]:
# sorting by those with the smallest range and creating a priority list based on the top 10
empty_priority_docks_df = empty_docks_df.sort_values('range', ascending = True).head(10)
empty_priority_docks_df
Out[19]:
dock_id dock_name _lat _long days_empty avg_min_time avg_med_time avg_max_time range_of_hours_atcritical range
1 400 Pitt St & Stanton St 40.719261 -73.981780 117 10.733476 12.665812 15.618519 4.885043 4.885043
17 3106 Driggs Ave & N Henry St 40.723250 -73.943080 128 10.462109 12.556641 15.920312 5.458203 5.458203
16 3050 Putnam Ave & Throop Ave 40.685153 -73.941110 121 11.907713 14.177548 17.621625 5.713912 5.713912
24 3289 W 90 St & Amsterdam Ave 40.790179 -73.972889 123 10.124119 12.430352 16.055149 5.931030 5.931030
26 3343 W 107 St & Columbus Ave 40.799757 -73.962113 125 10.635200 13.203867 17.408400 6.773200 6.773200
29 3366 West End Ave & W 107 St 40.802117 -73.968181 121 10.044215 12.879752 16.822176 6.777961 6.777961
25 3302 Columbus Ave & W 103 St 40.796935 -73.964341 122 9.393716 13.617486 20.042213 10.648497 10.648497
20 3144 E 81 St & Park Ave 40.776777 -73.959010 119 7.540756 12.164566 20.039636 12.498880 12.498880
30 3375 3 Ave & E 72 St 40.769943 -73.960607 140 6.556429 11.993214 20.935000 14.378571 14.378571
15 2023 E 55 St & Lexington Ave 40.759681 -73.970314 139 8.744005 16.762230 23.243765 14.499760 14.499760

Plotting these docks on different days will help understand the dynamic of them, and when they go empty

In [20]:
empty_priority_docks_list = list(empty_priority_docks_df.dock_id)
empty_priority_docks_list
Out[20]:
[400, 3106, 3050, 3289, 3343, 3366, 3302, 3144, 3375, 2023]
In [21]:
# defining function to graph them

def plot_utilization(date, dock_id):
    # graphing utilization over time 
    xs = list(df_all[(df_all['dock_id']==dock_id) & (df_all['date']==date)].mil_min_test)
    ys = list(df_all[(df_all['dock_id']==dock_id) & (df_all['date']==date)].utilization)

    xs, ys = zip(*sorted(zip(xs, ys))) # without sorting, it goes haywire

    # getting the address of the dock we're graphing
    address = str(df_all[df_all['dock_id']==dock_id].dock_name.drop_duplicates())
    dock_address = address[address.find(' '):address.find('\n')].strip()

    # formatting the graph
    plt.plot(xs, ys, label='%s' %dock_address)
    plt.xlabel("Hours in the Day", fontsize=16)
    plt.ylabel("Utilization %", fontsize=16)
    plt.axis([1, 24, 0.0, 1.0])
    plt.xticks(np.arange(0,26,1))
    plt.title('Dock ID = %i' %dock_id + ' Adress = %s' %dock_address)
In [26]:
# splitting it up so that we can better see, and choosing a random Monday

# plot 1
plt.figure(figsize=(15,8))
plt.subplot(211)

for dock in empty_priority_docks_list[0:5]:
    plot_utilization('2017-07-17',dock)

plt.legend(loc='upper left')
plt.show()

# plot 2
plt.figure(figsize=(15,8))
plt.subplot(212)

for dock in empty_priority_docks_list[5:11]:
    plot_utilization('2017-07-17',dock)

plt.legend(loc='upper left')
plt.show()

From these graphs, it looks like we care about the following docks:

  • Pitt & Stanton St
  • Driggs Ave & N Henry St
  • E 55 St & Lexington Ave
  • Putnam Ave & Throop Ave (maybe)
  • W 107 St & Columbus Ave (maybe)

Worth graphing another day just to see how it changes

In [27]:
# splitting it up so that we can better see, and choosing a random weekday (other than July 17th)

# plot 1
plt.figure(figsize=(15,8))
plt.subplot(211)

for dock in empty_priority_docks_list[0:5]:
    plot_utilization('2017-06-13',dock)

plt.legend(loc='upper left')
plt.show()

# plot 2
plt.figure(figsize=(15,8))
plt.subplot(212)

for dock in empty_priority_docks_list[5:11]:
    plot_utilization('2017-06-13',dock)

plt.legend(loc='upper left')
plt.show()

# maybe it makes sense to plot a distribution for each dock, but I can do that in a later version of this

So it looks like we should focus on all the docks from above, plus the following from the second graph:

  • E 55 St & Lexington Ave
  • West End Ave & W 107 St

6.2 Create df with priority empty docks and graph

(i) Creating df

  • narrowing list to priority empty docks
  • graphing one day (to make the map cleaner)
In [31]:
df_all_empty_priority = df_all[(df_all['dock_id'].isin(empty_priority_docks_list)) & (df_all['date'] == '2017-06-13')] 
df_all_empty_priority
Out[31]:
dock_id dock_name date hour minute pm avail_bikes avail_docks tot_docks _lat _long in_service status_key utilization mil_test mil_min_test year month day day_of_week
396785 400 Pitt St & Stanton St 2017-06-13 1 37 0 14 1 15 40.719261 -73.981780 1 1 0.933333 1 1.616667 2017 6 13 Tuesday
396786 400 Pitt St & Stanton St 2017-06-13 2 42 0 15 0 15 40.719261 -73.981780 1 1 1.000000 2 2.700000 2017 6 13 Tuesday
396787 400 Pitt St & Stanton St 2017-06-13 3 51 0 15 0 15 40.719261 -73.981780 1 1 1.000000 3 3.850000 2017 6 13 Tuesday
396788 400 Pitt St & Stanton St 2017-06-13 4 55 0 15 0 15 40.719261 -73.981780 1 1 1.000000 4 4.916667 2017 6 13 Tuesday
396789 400 Pitt St & Stanton St 2017-06-13 6 0 0 15 0 15 40.719261 -73.981780 1 1 1.000000 6 6.000000 2017 6 13 Tuesday
396790 400 Pitt St & Stanton St 2017-06-13 7 9 0 14 1 15 40.719261 -73.981780 1 1 0.933333 7 7.150000 2017 6 13 Tuesday
396791 400 Pitt St & Stanton St 2017-06-13 8 24 0 12 3 15 40.719261 -73.981780 1 1 0.800000 8 8.400000 2017 6 13 Tuesday
396792 400 Pitt St & Stanton St 2017-06-13 9 43 0 0 15 15 40.719261 -73.981780 1 1 0.000000 9 9.716667 2017 6 13 Tuesday
396793 400 Pitt St & Stanton St 2017-06-13 10 54 0 0 15 15 40.719261 -73.981780 1 1 0.000000 10 10.900000 2017 6 13 Tuesday
396794 400 Pitt St & Stanton St 2017-06-13 12 30 0 15 0 15 40.719261 -73.981780 1 1 1.000000 24 24.500000 2017 6 13 Tuesday
396795 400 Pitt St & Stanton St 2017-06-13 1 18 1 1 14 15 40.719261 -73.981780 1 1 0.066667 13 13.300000 2017 6 13 Tuesday
396796 400 Pitt St & Stanton St 2017-06-13 2 32 1 0 15 15 40.719261 -73.981780 1 1 0.000000 14 14.533333 2017 6 13 Tuesday
396797 400 Pitt St & Stanton St 2017-06-13 3 48 1 1 14 15 40.719261 -73.981780 1 1 0.066667 15 15.800000 2017 6 13 Tuesday
396798 400 Pitt St & Stanton St 2017-06-13 5 2 1 0 15 15 40.719261 -73.981780 1 1 0.000000 17 17.033333 2017 6 13 Tuesday
396799 400 Pitt St & Stanton St 2017-06-13 6 17 1 5 10 15 40.719261 -73.981780 1 1 0.333333 18 18.283333 2017 6 13 Tuesday
396800 400 Pitt St & Stanton St 2017-06-13 7 37 1 13 2 15 40.719261 -73.981780 1 1 0.866667 19 19.616667 2017 6 13 Tuesday
396801 400 Pitt St & Stanton St 2017-06-13 8 48 1 15 0 15 40.719261 -73.981780 1 1 1.000000 20 20.800000 2017 6 13 Tuesday
396802 400 Pitt St & Stanton St 2017-06-13 9 49 1 15 0 15 40.719261 -73.981780 1 1 1.000000 21 21.816667 2017 6 13 Tuesday
396803 400 Pitt St & Stanton St 2017-06-13 10 52 1 15 0 15 40.719261 -73.981780 1 1 1.000000 22 22.866667 2017 6 13 Tuesday
396804 400 Pitt St & Stanton St 2017-06-13 12 5 1 0 15 15 40.719261 -73.981780 1 1 0.000000 12 12.083333 2017 6 13 Tuesday
468707 2023 E 55 St & Lexington Ave 2017-06-13 1 37 0 0 35 36 40.759681 -73.970314 1 1 0.000000 1 1.616667 2017 6 13 Tuesday
468708 2023 E 55 St & Lexington Ave 2017-06-13 2 42 0 1 35 36 40.759681 -73.970314 1 1 0.027778 2 2.700000 2017 6 13 Tuesday
468709 2023 E 55 St & Lexington Ave 2017-06-13 3 51 0 2 34 36 40.759681 -73.970314 1 1 0.055556 3 3.850000 2017 6 13 Tuesday
468710 2023 E 55 St & Lexington Ave 2017-06-13 4 55 0 2 34 36 40.759681 -73.970314 1 1 0.055556 4 4.916667 2017 6 13 Tuesday
468711 2023 E 55 St & Lexington Ave 2017-06-13 6 0 0 1 35 36 40.759681 -73.970314 1 1 0.027778 6 6.000000 2017 6 13 Tuesday
468712 2023 E 55 St & Lexington Ave 2017-06-13 7 9 0 14 22 36 40.759681 -73.970314 1 1 0.388889 7 7.150000 2017 6 13 Tuesday
468713 2023 E 55 St & Lexington Ave 2017-06-13 8 24 0 32 4 36 40.759681 -73.970314 1 1 0.888889 8 8.400000 2017 6 13 Tuesday
468714 2023 E 55 St & Lexington Ave 2017-06-13 9 43 0 35 1 36 40.759681 -73.970314 1 1 0.972222 9 9.716667 2017 6 13 Tuesday
468715 2023 E 55 St & Lexington Ave 2017-06-13 10 54 0 30 6 36 40.759681 -73.970314 1 1 0.833333 10 10.900000 2017 6 13 Tuesday
468716 2023 E 55 St & Lexington Ave 2017-06-13 12 30 0 1 34 36 40.759681 -73.970314 1 1 0.027778 24 24.500000 2017 6 13 Tuesday
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
616621 3366 West End Ave & W 107 St 2017-06-13 1 18 1 0 31 31 40.802117 -73.968181 1 1 0.000000 13 13.300000 2017 6 13 Tuesday
616622 3366 West End Ave & W 107 St 2017-06-13 2 32 1 1 30 31 40.802117 -73.968181 1 1 0.032258 14 14.533333 2017 6 13 Tuesday
616623 3366 West End Ave & W 107 St 2017-06-13 3 48 1 1 30 31 40.802117 -73.968181 1 1 0.032258 15 15.800000 2017 6 13 Tuesday
616624 3366 West End Ave & W 107 St 2017-06-13 5 2 1 2 29 31 40.802117 -73.968181 1 1 0.064516 17 17.033333 2017 6 13 Tuesday
616625 3366 West End Ave & W 107 St 2017-06-13 6 17 1 4 27 31 40.802117 -73.968181 1 1 0.129032 18 18.283333 2017 6 13 Tuesday
616626 3366 West End Ave & W 107 St 2017-06-13 7 37 1 1 30 31 40.802117 -73.968181 1 1 0.032258 19 19.616667 2017 6 13 Tuesday
616627 3366 West End Ave & W 107 St 2017-06-13 8 48 1 9 22 31 40.802117 -73.968181 1 1 0.290323 20 20.800000 2017 6 13 Tuesday
616628 3366 West End Ave & W 107 St 2017-06-13 9 49 1 9 22 31 40.802117 -73.968181 1 1 0.290323 21 21.816667 2017 6 13 Tuesday
616629 3366 West End Ave & W 107 St 2017-06-13 10 52 1 12 19 31 40.802117 -73.968181 1 1 0.387097 22 22.866667 2017 6 13 Tuesday
616630 3366 West End Ave & W 107 St 2017-06-13 12 5 1 0 31 31 40.802117 -73.968181 1 1 0.000000 12 12.083333 2017 6 13 Tuesday
621336 3375 3 Ave & E 72 St 2017-06-13 1 37 0 3 32 35 40.769943 -73.960607 1 1 0.085714 1 1.616667 2017 6 13 Tuesday
621337 3375 3 Ave & E 72 St 2017-06-13 2 42 0 3 32 35 40.769943 -73.960607 1 1 0.085714 2 2.700000 2017 6 13 Tuesday
621338 3375 3 Ave & E 72 St 2017-06-13 3 51 0 3 32 35 40.769943 -73.960607 1 1 0.085714 3 3.850000 2017 6 13 Tuesday
621339 3375 3 Ave & E 72 St 2017-06-13 4 55 0 3 32 35 40.769943 -73.960607 1 1 0.085714 4 4.916667 2017 6 13 Tuesday
621340 3375 3 Ave & E 72 St 2017-06-13 6 0 0 1 34 35 40.769943 -73.960607 1 1 0.028571 6 6.000000 2017 6 13 Tuesday
621341 3375 3 Ave & E 72 St 2017-06-13 7 9 0 1 34 35 40.769943 -73.960607 1 1 0.028571 7 7.150000 2017 6 13 Tuesday
621342 3375 3 Ave & E 72 St 2017-06-13 8 24 0 0 35 35 40.769943 -73.960607 1 1 0.000000 8 8.400000 2017 6 13 Tuesday
621343 3375 3 Ave & E 72 St 2017-06-13 9 43 0 0 35 35 40.769943 -73.960607 1 1 0.000000 9 9.716667 2017 6 13 Tuesday
621344 3375 3 Ave & E 72 St 2017-06-13 10 54 0 2 33 35 40.769943 -73.960607 1 1 0.057143 10 10.900000 2017 6 13 Tuesday
621345 3375 3 Ave & E 72 St 2017-06-13 12 30 0 3 32 35 40.769943 -73.960607 1 1 0.085714 24 24.500000 2017 6 13 Tuesday
621346 3375 3 Ave & E 72 St 2017-06-13 1 18 1 1 34 35 40.769943 -73.960607 1 1 0.028571 13 13.300000 2017 6 13 Tuesday
621347 3375 3 Ave & E 72 St 2017-06-13 2 32 1 0 35 35 40.769943 -73.960607 1 1 0.000000 14 14.533333 2017 6 13 Tuesday
621348 3375 3 Ave & E 72 St 2017-06-13 3 48 1 2 33 35 40.769943 -73.960607 1 1 0.057143 15 15.800000 2017 6 13 Tuesday
621349 3375 3 Ave & E 72 St 2017-06-13 5 2 1 0 35 35 40.769943 -73.960607 1 1 0.000000 17 17.033333 2017 6 13 Tuesday
621350 3375 3 Ave & E 72 St 2017-06-13 6 17 1 0 35 35 40.769943 -73.960607 1 1 0.000000 18 18.283333 2017 6 13 Tuesday
621351 3375 3 Ave & E 72 St 2017-06-13 7 37 1 0 35 35 40.769943 -73.960607 1 1 0.000000 19 19.616667 2017 6 13 Tuesday
621352 3375 3 Ave & E 72 St 2017-06-13 8 48 1 0 35 35 40.769943 -73.960607 1 1 0.000000 20 20.800000 2017 6 13 Tuesday
621353 3375 3 Ave & E 72 St 2017-06-13 9 49 1 2 33 35 40.769943 -73.960607 1 1 0.057143 21 21.816667 2017 6 13 Tuesday
621354 3375 3 Ave & E 72 St 2017-06-13 10 52 1 1 34 35 40.769943 -73.960607 1 1 0.028571 22 22.866667 2017 6 13 Tuesday
621355 3375 3 Ave & E 72 St 2017-06-13 12 5 1 0 35 35 40.769943 -73.960607 1 1 0.000000 12 12.083333 2017 6 13 Tuesday

200 rows × 20 columns

In [32]:
# exporting to csv and saving as pickle file
df_all_empty_priority.to_csv(path_or_buf='/Users/SilviaRuiz/dsi_tasks/capstone/citibike/df_all_empty_priority.csv')
df_all_empty_priority.to_pickle('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/pickle_files/pickle_df_all_empty_priority.pkl')

(ii) Writing the r script

In [ ]:
# r script
# will have to figure out how to animate with gganimate in a later version

'''
# 2. importing libraries
library(htmltools)
library(ggmap)
library(gganimate)
library(gapminder)
library(sp)
library(leaflet)
library(dplyr)
library(tidyverse) # what lets you import csv

# 3. plotting of empty docks on Tuesday, June 13th

# 3.1 creating dataframe
df_all_empty_priority <- read_csv("/Users/SilviaRuiz/dsi_tasks/capstone/citibike/df_all_empty_priority.csv",
                                  col_types = cols(
                                    line = col_integer(), # had to manually insert "line"
                                    dock_id = col_double(),
                                    dock_name = col_character(),
                                    date = col_character(),
                                    hour = col_integer(),
                                    minute = col_integer(),
                                    pm = col_integer(),
                                    avail_bikes = col_integer(),
                                    avail_docks = col_integer(),
                                    tot_docks = col_integer(),
                                    lat = col_double(), # had to manually change name from _lat
                                    long = col_double(), # had to manually change name from _long
                                    in_service = col_integer(),
                                    status_key = col_integer(),
                                    utilization = col_double(),
                                    mil_test = col_double(),
                                    mil_min_test = col_double(),
                                    year = col_integer(),
                                    month = col_integer(),
                                    day = col_integer(),
                                    day_of_week = col_character()
                                  ))
head(df_all_empty_priority)

# 3.2 building spatial df and setting up colors
spatial_df_empty <- df_all_empty_priority
coordinates(spatial_df_empty) <- ~ long + lat

pal <- colorNumeric("RdGy", domain = NULL, reverse = TRUE)

# 3.3 creating base map, and passing in df
leaflet(spatial_df_empty) %>%
  addTiles() %>%
  setView(-74.00, 40.71, zoom = 12) %>%
  addProviderTiles("Thunderforest.OpenCycleMap") %>%
  addCircleMarkers(data = spatial_df_empty,
                   radius = ~ntile(utilization,7),
                   color = ~pal(utilization),
                   label = ~dock_name,
                   #labelOptions = labelOptions(noHide = T),
                   stroke = FALSE, fillOpacity = 0.5)
'''

(iii) Graphing them on the map and see where they are in the city

In [3]:
# files saved in the same directory due to security concerns on notebooks not allowing access to folders above
# I want to use the map that has the terrain so I can see what's around these docks (next version)

IFrame('empty_priority_docks_map_finalv.html', width=900, height=450)
Out[3]:

Full Docks

6.3 Narrowing down priority Full Docks

Simiarly to the empty docks exploration, we want to explore docks that become full during certain times. Some of these might overlap with the above exploration. For example, E 55 St Lexington Ave experiences almost at-capacity during work hours and empties out afterwards. This is also the case with Pitt & Stanton St and Driggs Ave & N Henry St.

In [47]:
# getting average min, median, and max time of utilization at 100%
'''
select
      a.dock_id,
	  x.dock_name,
	  x._lat as lat,
	  x._long as long,
	  count(distinct a.date) as days_full,
      avg(min_time) as avg_min_time,
      avg(med_time) as avg_med_time,
      avg(max_time) as avg_max_time,
	  avg(max_time) - avg(min_time) as range_of_hours_atcritical
from (
			select date,
            orig.dock_id,
			count(utilization) as times_full_in_day,
            min(mil_min_test) as min_time,
            percentile_disc(0.5) within group (order by mil_min_test) as med_time,
            max(mil_min_test) as max_time
      from raw_data.all_data as orig
      inner join

			-- 1. Docks that have been full a lot during the day in the full time period
			(select dock_id,
					sum(case when utilization = 1 then 1 else 0 end) as times_full,
					sum(case when utilization = 0 then 1 else 0 end) as times_empty
			from raw_data.all_data
			where mil_test between 7 and 22 -- the full day (7am-10pm)
					and in_service = 1 -- only docks in service
			group by 1
			order by 2 desc -- sort by docks that are most full in the time period
			limit 30 -- only take the top 30
			) as full_docks on orig.dock_id = full_docks.dock_id

	  where utilization = 1 -- change to "1" for "full"
      group by 1,2
    ) as a
	left join raw_data.all_data x on a.dock_id = x.dock_id
group by 1,2,3,4
'''

full_docks_df = pd.read_csv('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/full_docks.csv')
In [48]:
full_docks_df.head()
Out[48]:
dock_id dock_name lat long days_full avg_min_time avg_med_time avg_max_time range_of_hours_atcritical
0 291 Madison St & Montgomery St 40.713126 -73.984844 78 7.666239 11.356838 22.867094 15.200855
1 337 Old Slip & Front St 40.703799 -74.008387 47 10.553546 11.582624 13.591135 3.037589
2 356 Bialystoker Pl & Delancey St 40.716226 -73.982612 65 10.011026 13.383590 21.095641 11.084615
3 393 E 5 St & Avenue C 40.722992 -73.979955 72 7.541204 10.701620 21.781713 14.240509
4 400 Pitt St & Stanton St 40.719261 -73.981780 82 7.666667 11.129878 23.124390 15.457724
In [63]:
# melt df to have min, med, max in one column

full_docks_df_melt = pd.melt(full_docks_df, id_vars=['dock_id', 'dock_name'], value_vars=['avg_min_time', 'avg_med_time', 'avg_max_time'])\
                    .sort_values(by=['dock_id'],ascending = True)

full_docks_df_melt.head()
Out[63]:
dock_id dock_name variable value
0 291 Madison St & Montgomery St avg_min_time 7.666239
60 291 Madison St & Montgomery St avg_max_time 22.867094
30 291 Madison St & Montgomery St avg_med_time 11.356838
1 337 Old Slip & Front St avg_min_time 10.553546
61 337 Old Slip & Front St avg_max_time 13.591135
In [64]:
full_docks_df_melt_v2 = full_docks_df_melt.reset_index(drop = True)

full_docks_df_melt_v2.head()
Out[64]:
dock_id dock_name variable value
0 291 Madison St & Montgomery St avg_min_time 7.666239
1 291 Madison St & Montgomery St avg_max_time 22.867094
2 291 Madison St & Montgomery St avg_med_time 11.356838
3 337 Old Slip & Front St avg_min_time 10.553546
4 337 Old Slip & Front St avg_max_time 13.591135
In [66]:
# Initialize the figure
plt.figure(figsize=(20,7))
sns.despine(bottom=True, left=True)

# Graph the min, med, and max for all the "empty docks" times
sns.pointplot(x="dock_id", y="value", hue="variable",
              data=full_docks_df_melt_v2, dodge=.532, join=False, palette="dark",
              markers="d")

plt.ylabel('Hour')
plt.title('Average min/median/max times of docks at 0% utilization')
Out[66]:
Text(0.5,1,'Average min/median/max times of docks at 0% utilization')

These docks experience time at critical capacity much longer than empty docks.

In [68]:
# sorting by those with the smallest range and creating a priority list based on the top 10
full_priority_docks_df = full_docks_df.sort_values('range_of_hours_atcritical', ascending = True).head(10)
full_priority_docks_df
Out[68]:
dock_id dock_name lat long days_full avg_min_time avg_med_time avg_max_time range_of_hours_atcritical
1 337 Old Slip & Front St 40.703799 -74.008387 47 10.553546 11.582624 13.591135 3.037589
5 418 Front St & Gold St 40.702240 -73.982578 52 11.277564 12.536218 14.757051 3.479487
18 3373 3 St & 3 Ave 40.675070 -73.987752 54 10.814815 12.161111 15.387963 4.573148
28 3420 Douglass St & 3 Ave 40.680213 -73.984327 64 12.506510 14.755208 17.860937 5.354427
26 3407 Union St & Nevins St 40.679098 -73.987655 58 14.359483 16.862644 20.189368 5.829885
27 3419 Douglass St & 4 Ave 40.679279 -73.981540 61 12.430328 15.417213 19.970765 7.540437
25 3399 7 St & 3 Ave 40.672603 -73.989830 58 11.522989 14.532471 19.369540 7.846552
19 3381 3 St & Hoyt St 40.677729 -73.993641 54 10.527778 13.662346 18.858025 8.330247
9 3335 Union St & 4 Ave 40.677274 -73.982820 68 11.909559 14.766422 20.633088 8.723529
14 3347 Van Brunt St & Wolcott St 40.677343 -74.012751 57 10.590936 13.652924 21.197953 10.607018
In [70]:
full_priority_docks_list = list(full_priority_docks_df.dock_id)
full_priority_docks_list
Out[70]:
[337, 418, 3373, 3420, 3407, 3419, 3399, 3381, 3335, 3347]
In [71]:
# splitting it up so that we can better see, and choosing a random Monday

# Day 1

# plot 1
plt.figure(figsize=(15,8))
plt.subplot(211)

for dock in full_priority_docks_list[0:5]:
    plot_utilization('2017-07-17',dock)

plt.legend(loc='upper left')
plt.show()

# plot 2
plt.figure(figsize=(15,8))
plt.subplot(212)

for dock in full_priority_docks_list[5:11]:
    plot_utilization('2017-07-17',dock)

plt.legend(loc='upper left')
plt.show()
In [72]:
# Day 2

# plot 1
plt.figure(figsize=(15,8))
plt.subplot(211)

for dock in full_priority_docks_list[0:5]:
    plot_utilization('2017-06-13',dock)

plt.legend(loc='upper left')
plt.show()

# plot 2
plt.figure(figsize=(15,8))
plt.subplot(212)

for dock in full_priority_docks_list[5:11]:
    plot_utilization('2017-06-13',dock)

plt.legend(loc='upper left')
plt.show()
In [74]:
# Day 3

# plot 1
plt.figure(figsize=(15,8))
plt.subplot(211)

for dock in full_priority_docks_list[0:5]:
    plot_utilization('2017-08-10',dock)

plt.legend(loc='upper left')
plt.show()

# plot 2
plt.figure(figsize=(15,8))
plt.subplot(212)

for dock in full_priority_docks_list[5:11]:
    plot_utilization('2017-08-10',dock)

plt.legend(loc='upper left')
plt.show()

These docks don't vary as much during the day. There are a couple of docks that seem to vary by commute:

  • Old Slip & Front St
  • Front St & Gold St

The rest are full often, and these might benefit for just more bikes [will see how big they are now]

6.3cont'd: Can we fix "full docks" just by adding more capacity? How big are they now?

Another thing to consider on the next version: what is the utilization of the docks nearby?

In [81]:
# seeing how big the bottom docks are
# since we know that docks can have different capacities over time, we just want to grab the max capacity

df_all_max_capacity = df_all[['dock_id','tot_docks']].groupby('dock_id')['tot_docks'].max().to_frame().reset_index()
df_all_max_capacity
Out[81]:
dock_id tot_docks
0 72 39
1 79 33
2 82 27
3 83 62
4 116 39
5 119 19
6 120 19
7 127 31
8 128 30
9 143 24
10 144 19
11 146 39
12 147 33
13 150 31
14 151 33
15 152 29
16 153 55
17 157 23
18 161 35
19 164 47
20 167 45
21 168 47
22 173 51
23 174 30
24 195 45
25 212 28
26 216 23
27 217 39
28 223 33
29 224 31
... ... ...
661 3453 19
662 3454 24
663 3455 35
664 3456 22
665 3457 46
666 3458 40
667 3459 34
668 3461 33
669 3462 39
670 3463 37
671 3464 57
672 3466 34
673 3467 42
674 3469 23
675 3472 39
676 3474 36
677 3476 25
678 3477 33
679 3478 43
680 3479 45
681 3480 18
682 3481 22
683 3486 30
684 3489 31
685 3490 31
686 3491 24
687 3492 27
688 3493 31
689 3494 25
690 3495 26

691 rows × 2 columns

In [82]:
# saving it as a pickle file in case I need it later

df_all_max_capacity.to_pickle('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/pickle_files/pickle_df_all_max_capacity.pkl')
In [83]:
df_full_max_capacity = df_all_max_capacity[df_all_max_capacity['dock_id'].isin(full_priority_docks_list)]
df_full_max_capacity
Out[83]:
dock_id tot_docks
111 337 37
182 418 23
557 3335 28
569 3347 21
595 3373 27
602 3381 21
619 3399 24
626 3407 23
638 3419 27
639 3420 21
In [85]:
plt.title("all docks vs. full docks capacity")
sns.distplot(df_all[df_all['in_service'] == 1].groupby('dock_id')['tot_docks'].mean(), label='all docks')
sns.distplot(df_full_max_capacity['tot_docks'], label='full docks')
plt.legend()
Out[85]:
<matplotlib.legend.Legend at 0x1a1ef38160>

The Full docks are not particularly small compared to the all the docks in the distribution, but they are also not on the larger side. This means that if every dock were to increase its capacity, it would still be within the standards of how big a CitiBike dock is.

6.4 Create df with priority full docks and graph

In [86]:
df_all_full_priority = df_all[(df_all['dock_id'].isin(full_priority_docks_list)) & (df_all['date'] == '2017-06-13')] 
df_all_full_priority.head()
Out[86]:
dock_id dock_name date hour minute pm avail_bikes avail_docks tot_docks _lat _long in_service status_key utilization mil_test mil_min_test year month day day_of_week
368965 337 Old Slip & Front St 2017-06-13 1 37 0 3 33 37 40.703799 -74.008387 1 1 0.081081 1 1.616667 2017 6 13 Tuesday
368966 337 Old Slip & Front St 2017-06-13 2 42 0 3 33 37 40.703799 -74.008387 1 1 0.081081 2 2.700000 2017 6 13 Tuesday
368967 337 Old Slip & Front St 2017-06-13 3 51 0 3 33 37 40.703799 -74.008387 1 1 0.081081 3 3.850000 2017 6 13 Tuesday
368968 337 Old Slip & Front St 2017-06-13 4 55 0 3 33 37 40.703799 -74.008387 1 1 0.081081 4 4.916667 2017 6 13 Tuesday
368969 337 Old Slip & Front St 2017-06-13 6 0 0 3 33 37 40.703799 -74.008387 1 1 0.081081 6 6.000000 2017 6 13 Tuesday
In [87]:
# exporting to csv and saving as pickle file
df_all_full_priority.to_csv(path_or_buf='/Users/SilviaRuiz/dsi_tasks/capstone/citibike/df_all_full_priority.csv')
df_all_full_priority.to_pickle('/Users/SilviaRuiz/dsi_tasks/capstone/citibike/pickle_files/pickle_df_all_full_priority.pkl')
In [ ]:
# r script
# will have to figure out how to animate with gganimate in a later version

'''
# 2. importing libraries
library(htmltools)
library(ggmap)
library(gganimate)
library(gapminder)
library(sp)
library(leaflet)
library(dplyr)
library(tidyverse) # what lets you import csv

# 3. plotting of full docks on Tuesday, June 13th

# 3.1 creating dataframe
df_all_full_priority <- read_csv("/Users/SilviaRuiz/dsi_tasks/capstone/citibike/df_all_full_priority.csv",
                                  col_types = cols(
                                    line = col_integer(), # had to manually insert "line"
                                    dock_id = col_double(),
                                    dock_name = col_character(),
                                    date = col_character(),
                                    hour = col_integer(),
                                    minute = col_integer(),
                                    pm = col_integer(),
                                    avail_bikes = col_integer(),
                                    avail_docks = col_integer(),
                                    tot_docks = col_integer(),
                                    lat = col_double(), # had to manually change name from _lat
                                    long = col_double(), # had to manually change name from _long
                                    in_service = col_integer(),
                                    status_key = col_integer(),
                                    utilization = col_double(),
                                    mil_test = col_double(),
                                    mil_min_test = col_double(),
                                    year = col_integer(),
                                    month = col_integer(),
                                    day = col_integer(),
                                    day_of_week = col_character()
                                  ))
head(df_all_full_priority)

# 3.2 building spatial df and setting up colors
spatial_df_full <- df_all_full_priority
coordinates(spatial_df_full) <- ~ long + lat

pal <- colorNumeric("RdGy", domain = NULL, reverse = TRUE)

# 3.3 creating base map, and passing in df
leaflet(spatial_df_full) %>%
  addTiles() %>%
  setView(-74.00, 40.71, zoom = 12) %>%
  addProviderTiles("Thunderforest.OpenCycleMap") %>%
  addCircleMarkers(data = spatial_df_full,
                   radius = ~ntile(utilization,7),
                   color = ~pal(utilization),
                   label = ~dock_name,
                   #labelOptions = labelOptions(noHide = T),
                   stroke = FALSE, fillOpacity = 0.5)
'''
In [4]:
#full_priority_docks_map.html

IFrame('full_priority_docks_map.html', width=900, height=450)
Out[4]:

Conclusion

  • The whole system is generally under utilized
  • However, there are some docks that experience more moments at critical capacity more than others
  • For docks that experience time at 0% utilization, many vary by commuting times. i.e., they are empty during the work day, but are either full or capacity > 50% during non-work hours. In other words, these are docks around residential areas within biking distance of major/large offices.
    • For these docks, increasing the number of dock capacity will help aliviate these empty times by having more bikes on dock to begin with
  • For docks that experience time at 100% utilization, many spend a large portion of the day full. This means that the docks are not big enough for the traffic they generate. They are also not among the larger docks.
    • For these docks, increase the number of docks in each station will help aliviate time at capacity

Next Steps

  • Animate graphs over time with gganimate to understand interplay between docks
  • Estimate how many additonal bikes per station are needed
  • Analyze docks around "critical docks" to see if they experience similar issues (and can be prevented ahead of time)
  • Model financially what it costs to increase dock capacity, and study how that correlates or drives membership or user engagement
In [ ]:
# saving as html
!jupyter nbconvert --to html capstone_presentation_ruizs2018.ipynb